Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.0 - Check here for latest version

Crawl Web (Web Mining)

Synopsis

This operator allows to crawl the web and store the retrieved links and pages in an ExampleSet or on disk.

Description

This crawler will start on the specified starting URL to load pages and follow all links as commanded by the rules. There are different types of rules, each one applied in different situations:

  • store_with_matching_url: If the regular expression matches the URL, this page will be stored in the resulting ExampleSet and on disk (if selected).
  • store_with_matching_content: If the page content contains the given term, this page will be stored in the resulting ExampleSet. Note: Using this filter will slow down crawling a lot! Also note that this is NOT a regular expression but a simple contains filter.
  • follow_link_with_matching_url: If the regular expression matches the URL, the crawler will follow the link and load the URL.

To avoid crawling a potentially unlimited number of pages, the maximal number of pages and depth the crawler will retrieve can be specified with the parameters max pages and max depth. To speed up loading, the delay can be lowered. But please be friendly to the web site owners and avoid causing high traffic on their sites. Otherwise you may get blacklisted. Note that while the crawling makes use of your available CPU cores (license limits apply), usually crawling speed is limited by your bandwidth, disk IO (if applicable), the crawling delay and the fact that this crawler is benign and queries the robots.txt for each page it visits.

Please let the ignore robot exclusion parameter be unchecked unless you are going to crawl your own sites. Some site owners might forbid crawling of their content and for legal reasons you may be bound to their will.

Output

  • Example Set (Data Table)

    The example set port which returns the crawling results.

Parameters

  • urlThe root page from which the crawler will start. Range:
  • crawling_rulesSpecifies a set of rules that determine which links to follow and which pages to process. Range:
  • retrieve_as_htmlIf selected, the actual HTML is returned instead of a textual representation. Range:
  • enable_basic_authIf selected, all requests will send basic auth information in their header. Use only when crawling HTTPS pages! Range:
  • usernameUsername for basic authentication. Range:
  • passwordPassword for basic authentication. Range:
  • add_content_as_attributeSpecifies, whether the pages' content should be added as a text attribute. Range:
  • write_pages_to_diskSpecifies if the crawled pages should be saved as files. Range:
  • include_binary_contentIf selected, the crawler will also consider binary content instead of only text pages. This can be useful to for example download all .pdf files from a web site by making use of the crawling rules parameter. Range:
  • output_dirSpecifies the directory on disk into which the files are written if write pages into files is selected. Range:
  • output_file_extensionSpecifies the file extension of the stored files. Range:
  • max_crawl_depthSpecifies the maximal depth of the crawling process. A depth of 1 means 'only crawl direct links on the initial page'. Range:
  • max_pagesThe maximal number of pages to store. Range:
  • max_page_sizeSpecifies the maximum page size (in KB): pages larger than this limit are not downloaded. Range:
  • delaySpecifies the delay when visiting a page in milliseconds. Range:
  • max_concurrent_connectionsMaximum amount of HTTP connections used at the same time. Range:
  • max_connections_per_hostMaximum amount of simultaneous HTTP connections used to connect to a single host. Increasing this parameter can put heavy load on a host so please be careful! Range:
  • user_agentThe identity the crawler uses while accessing a server. Range:
  • ignore_robot_exclusionSpecifies whether the crawler should ignore the robot exclusion rules set by the crawled page. Enable only for your own sites, otherwise you may end up violating laws! Range: